import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.preprocessing import StandardScaler
from factor_analyzer import FactorAnalyzer # Perform statistical tests before PCA
import warnings
warnings.filterwarnings("ignore")
import os
import statsmodels.api as sm
from statsmodels.formula.api import ols # For n-way ANOVA
from statsmodels.stats.anova import _get_covariance,anova_lm # For n-way ANOVA
from sklearn.preprocessing import StandardScaler
os.chdir('C:\\Users\\tmaji\\Downloads')
os.getcwd()
df=pd.read_csv('SalaryData.csv')
df.head()
df.describe()
df.info()
df.Education = pd.Categorical(df.Education)
df['Education'].value_counts()
df.Occupation = pd.Categorical(df.Occupation)
df['Occupation'].value_counts()
df.info()
df.groupby('Education').mean()
df.groupby('Occupation').mean()
df.describe()
Null hypothesis = H0: Salary is depend on educational qualification and occupation. Alternate hypothesis = H1: Salary is not depend on educational qualification and occupation;i.e. Salary is dependent on either one of the component(Education or Occupation)
model = ols('Salary ~ Occupation', data=df).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
aov_table
pvalue=0.458508 is greater than the level of significance 𝛼 0.05
The null hypothesis is fail to rejected based on the above observation and it is concluded that Occupation has no significant effect on Salary
model = ols('Salary ~ Education', data=df).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
aov_table
pvalue=1.257709e-08 is smaller than the level of significance 𝛼 0.05
The null hypothesis is rejected based on the above observation and it is concluded that Education has significant effect on Salary
model = ols('Salary ~ Education', data=df).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
aov_table
pvalue=1.257709e-08 is smaller than the level of significance 𝛼 0.05
The null hypothesis is rejected based on the above observation and it is concluded that Education has significant effect on Salary
model = ols('Salary ~ Occupation', data=df).fit()
aov_table = sm.stats.anova_lm(model, typ=2)
aov_table
pvalue=0.458508 is greater than the level of significance 𝛼 0.05
The null hypothesis is fail to rejected based on the above observation and it is concluded that Occupation has no significant effect on Salary
#The null hypothesis is rejected based on the above observation and it is concluded that
#Education has significant effect on Salary
#The null hypothesis is fail to rejected based on the above observation and it is concluded that
#Occupation has no significant effect on Salary
sns.pointplot(x = 'Education', y = 'Salary',hue='Occupation',data=df)
plt.grid()
plt.show()
sns.pointplot(x = 'Occupation', y = 'Salary',hue='Education',data=df)
plt.grid()
plt.show()
As seen from the above two interaction plots, there seems to be very interaction amongst the two categorical variables.
#perform two-way ANOVA
model = ols('Salary ~ C(Education) + C(Occupation) + C(Education):C(Occupation)', data=df).fit()
sm.stats.anova_lm(model, typ=2)
data = pd.read_csv('Education+-+Post+12th+Standard.csv')
data.head()
data.info()
data.describe()
Univarient Analysis
plt.figure(figsize=(20,10))
sns.boxplot(data=data)
plt.grid()
plt.show()
sns.distplot(data['Enroll']);
From above figure, we can say that the Enroll parameter is right skewed
plt.figure(figsize=(12,8))
plt.subplot(1,4,1)
sns.distplot(data['Apps'])
plt.subplot(1,4,2)
sns.distplot(data['F.Undergrad'])
plt.subplot(1,4,3)
sns.distplot(data['Grad.Rate'])
plt.subplot(1,4,4)
sns.distplot(data['PhD'])
#form the above figure we can conclude that "Applications Recieved","Full time Under Grad" are right skiewed
#form the above figure we can conclude that "PhD" are Left skiewed
#from the above figure we can conclude the "Graduation Rate" is Normally distributed
Multivaraite Analysis
#Pairplot of all variables
sns.pairplot(data)
cor=data.corr()
cor
plt.figure(figsize=(12,12))
sns.heatmap(cor, annot=True)
In the above plot scatter diagrams are plotted for all the numerical columns in the dataset. A scatter plot is a visual representation of the degree of correlation between any two columns. The pair plot function in seaborn makes it very easy to generate joint scatter plots for all the columns in the data.
Often the variables of the data set are of different scales i.e. one variable is in millions and other in only 100.
Number: Apps,Accept,Enroll,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal. Percentage :Top10perc,Top25perc,PhD,Terminal,perc.alumni. Ratio:Grad.Rate,
In this method, we convert variables with different scales of measurements into a single scale.
StandardScaler normalizes the data using the formula (x-mean)/standard deviation.
new_df=data.copy()
new_df
## Dropping the name feature before we scale numeric values as the same will not add any value in model building
new_df.drop(labels='Names',axis=1,inplace=True)
new_df.head()
Covariance and Correlation are two mathematical concepts which are quite commonly used in statistics. Both of these two determine the relationship and measures the dependency between two random variables. Despite, some similarities between these two mathematical terms, they are different from each other. Correlation is when the change in one item may result in the change in the another item. On the other hand, covariance is when two items vary together. Read the given article to know the differences between covariance and correlation.
A measure used to indicate the extent to which two random variables change in tandem is known as covariance. A measure used to represent how strongly two random variables are related known as correlation. Covariance is nothing but a measure of correlation. On the contrary, correlation refers to the scaled form of covariance. The value of correlation takes place between -1 and +1. Conversely, the value of covariance lies between -∞ and +∞. Covariance is affected by the change in scale, i.e. if all the value of one variable is multiplied by a constant and all the value of another variable are multiplied, by a similar or different constant, then the covariance is changed. As against this, correlation is not influenced by the change in scale. Correlation is dimensionless, i.e. it is a unit-free measure of the relationship between variables. Unlike covariance, where the value is obtained by the product of the units of the two variables.
#covariance
cov_matrix = np.cov(new_df.T)
cov_matrix
Both Correlation and Covariance are very closely related to each other and yet they differ a lot.
When it comes to choosing between Covariance vs Correlation, the latter stands to be the first choice as it remains unaffected by the change in dimensions, location, and scale, and can also be used to make a comparison between two pairs of variables. Since it is limited to a range of -1 to +1, it is useful to draw comparisons between variables across domains. However, an important limitation is that both these concepts measure the only linear relationship.
new_df=data.copy()
new_df
## Dropping the name feature before we scale numeric values as the same will not add any value in model building
new_df.drop(labels='Names',axis=1,inplace=True)
new_df.head()
cat=[]
num=[]
for i in new_df.columns:
if new_df[i].dtype=="object":
cat.append(i)
else:
num.append(i)
print(cat)
print(num)
# Method 1
## Using Zscore for scaling/standardisation
from scipy.stats import zscore
data_scaled=new_df[num].apply(zscore)
data_scaled.head()
new_df.describe()
data_scaled.describe()
# Method II
## Using standardScaler for Standardisation
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(new_df[num])
data_standard=scaler.transform(new_df[num])
data_standard=pd.DataFrame(data_standard, columns=new_df[num].columns)
data_standard.describe()
# Method III Min-Max method
from sklearn.preprocessing import MinMaxScaler
# build the scaler model
scaler = MinMaxScaler().fit(new_df[num])
# transform the test test
data_minmax = scaler.transform(new_df[num])
data_minmax=pd.DataFrame(data_minmax, columns=new_df[num].columns)
data_minmax.describe()
Applying zscore or using StandardScalar give us the same results
It scales the data in such a way that the mean value of the features tends to 0 and the standard deviation tends to 1
Min-Max method ensure that the data scaled to have values in the range 0 to 1
from sklearn.decomposition import PCA
pca = PCA(n_components = 1)
data_reduced_PC1 = pca.fit_transform(new_df)
data_reduced_PC1
#The amount of variance that each PC explains
var= pca.explained_variance_ratio_
var
# normalize data
#from sklearn import preprocessing
#data_scaled = pd.DataFrame(preprocessing.scale(bat_PCA),columns = bat_PCA.columns)
data_scaled = new_df
# PCA
pca = PCA(n_components=1)
pca.fit_transform(data_scaled)
# Dump components relations with features:
df_PC =pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1'])
pca.components_
from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all,kmo_model=calculate_kmo(new_df)
kmo_model
cov_matrix = np.cov(new_df.T)
print('Covariance Matrix \n%s', cov_matrix)
# Step 2- Get eigen values and eigen vector
eig_vals, eig_vecs = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s', eig_vecs)
print('\n Eigen Values \n%s', eig_vals)
plt.plot(var_exp)
Visually we can observe that their is steep drop in variance explained with increase in number of PC's. In the above scree plot:
• 48.91 % of the total variation is explained by first Principal Component and confirmed by screeplot. • In the scree plot, the last big drop occurs between the first and second components and we choose the first component.
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
print("Cumulative Variance Explained", cum_var_exp)
var_exp
Number: Apps,Accept,Enroll,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal. Percentage :Top10perc,Top25perc,PhD,Terminal,perc.alumni. Ratio:Grad.Rate,
In this method, we convert variables with different scales of measurements into a single scale.
StandardScaler normalizes the data using the formula (x-mean)/standard deviation..
When it comes to choosing between Covariance vs Correlation, the latter stands to be the first choice as it remains unaffected by the change in dimensions, location, and scale, and can also be used to make a comparison between two pairs of variables. Since it is limited to a range of -1 to +1, it is useful to draw comparisons between variables across domains. However, an important limitation is that both these concepts measure the only linear relationship.
Applying zscore or using StandardScalar give us the same results
It scales the data in such a way that the mean value of the features tends to 0 and the standard deviation tends to 1
Min-Max method ensure that the data scaled to have values in the range 0 to 1
Visually we can observe that their is steep drop in variance explained with increase in number of PC's. In the above scree plot:
• 48.91 % of the total variation is explained by first Principal Component and confirmed by screeplot. • In the scree plot, the last big drop occurs between the first and second components and we choose the first component.